Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/remove duplicates in rcm #13

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

noctillion
Copy link
Contributor

Code updates to refresh the GFF3 release, and code refactoring to prevent duplicate genId entries in the generated transcriptomic raw count matrix.

Copy link

@v-rocheleau v-rocheleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resulting matrices are good, thanks!
Some small code requests.

Comment on lines 70 to 74
def generate_gene_names(self, url):
file_name = os.path.basename(urlparse(url).path)
self.gene_names = self.download_gff(url, file_name)
self.num_genes = len(self.gene_names)
if not hasattr(self, 'gene_names') or not self.gene_names:
self.download_gff(url, file_name)
return self.gene_names

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that this function is not used anymore, since process_gff sets self.gene_names. RM

@@ -45,23 +45,32 @@ def load_sample_info(self, json_file_path):
self.treatments = [item['Treatment'] for item in sample_info]
self.experiment_id = [item['ExperimentID'] for item in sample_info]

def download_gff(self, url, file_path):
def download_gff(self, url, file_path):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename to download_and_process_gff, since it does more than just download now.


self.gene_names = gene_info['GeneName'].tolist()
self.num_genes = len(self.gene_names)
return output_file_csv, file_path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return values are not used

if not os.path.exists(file_path):
subprocess.run(['wget', '-O', file_path, url], check=True)
self.process_gff(file_path)

def process_gff(self, file_path):
Copy link

@v-rocheleau v-rocheleau Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename to something like write_gene_info, since the function now saves the gene lengths in a csv.

print(f"Gene lengths have been saved to {output_file_csv}.")

self.gene_names = gene_info['GeneName'].tolist()
self.num_genes = len(self.gene_names)
Copy link

@v-rocheleau v-rocheleau Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this kind of situation, when you need to hold a list of values (gene_names), and make decisions based the length of the list, it is best to simply get the length directly from the property when you need it. With self.gene_names, all we need to get the number of genes is len(self.gene_names).

Otherwise, you have to maintain the state of 2 closely linked properties instead of just 1, which is not ideal for maintainability.

Comment on lines +103 to +108
if not self.gene_names:
raise ValueError("No gene names available for count matrix generation.")

unique_gene_names = list(filter(None, set(self.gene_names)))
if not unique_gene_names:
raise ValueError("No valid gene names available after filtering duplicates and empty entries.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed, since the values are already deduplicated when the self.gene_names property is set.

genes.loc[:, 'length'] = genes['end'].astype(int) - genes['start'].astype(int) + 1
genes = genes['GeneName'].dropna().tolist()
return genes
gene_info = genes[['GeneName', 'length']].dropna().drop_duplicates(subset='GeneName')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns should be GeneID and GeneLength for TDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants